This block of code loads two datasets: COVID-19 vaccination data and COVID-19 deaths data for the United States.
import pandas as pd
# Load the vaccination data
vaccinations_path = 'us_state_vaccinations.csv'
vaccinations_data = pd.read_csv(vaccinations_path)
# Load the deaths data
deaths_path = 'time_series_covid19_deaths_US.csv'
deaths_data = pd.read_csv(deaths_path)
# Display the first few rows of each dataset
vaccinations_data.head(), deaths_data.head()
( date location total_vaccinations total_distributed \
0 2021-01-12 Alabama 78134.0 377025.0
1 2021-01-13 Alabama 84040.0 378975.0
2 2021-01-14 Alabama 92300.0 435350.0
3 2021-01-15 Alabama 100567.0 444650.0
4 2021-01-16 Alabama NaN NaN
people_vaccinated people_fully_vaccinated_per_hundred \
0 70861.0 0.15
1 74792.0 0.19
2 80480.0 NaN
3 86956.0 0.28
4 NaN NaN
total_vaccinations_per_hundred people_fully_vaccinated \
0 1.59 7270.0
1 1.71 9245.0
2 1.88 NaN
3 2.05 13488.0
4 NaN NaN
people_vaccinated_per_hundred distributed_per_hundred \
0 1.45 7.69
1 1.53 7.73
2 1.64 8.88
3 1.77 9.07
4 NaN NaN
daily_vaccinations_raw daily_vaccinations daily_vaccinations_per_million \
0 NaN NaN NaN
1 5906.0 5906.0 1205.0
2 8260.0 7083.0 1445.0
3 8267.0 7478.0 1525.0
4 NaN 7498.0 1529.0
share_doses_used total_boosters total_boosters_per_hundred
0 0.207 NaN NaN
1 0.222 NaN NaN
2 0.212 NaN NaN
3 0.226 NaN NaN
4 NaN NaN NaN ,
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region \
0 84001001 US USA 840 1001.0 Autauga Alabama US
1 84001003 US USA 840 1003.0 Baldwin Alabama US
2 84001005 US USA 840 1005.0 Barbour Alabama US
3 84001007 US USA 840 1007.0 Bibb Alabama US
4 84001009 US USA 840 1009.0 Blount Alabama US
Lat Long_ ... 2/28/23 3/1/23 3/2/23 3/3/23 3/4/23 3/5/23 \
0 32.539527 -86.644082 ... 230 232 232 232 232 232
1 30.727750 -87.722071 ... 724 726 726 726 726 726
2 31.868263 -85.387129 ... 103 103 103 103 103 103
3 32.996421 -87.125115 ... 109 109 109 109 109 109
4 33.982109 -86.567906 ... 261 261 261 261 261 261
3/6/23 3/7/23 3/8/23 3/9/23
0 232 232 232 232
1 726 726 727 727
2 103 103 103 103
3 109 109 109 109
4 261 261 261 261
[5 rows x 1155 columns])
This data preparation process ensures that the date columns are in a consistent format and facilitates the analysis of COVID-19 deaths over time and across different states.
# Convert date columns to datetime for both datasets
vaccinations_data['date'] = pd.to_datetime(vaccinations_data['date'])
# For the deaths data, extract dates from the columns and melt the dataframe
date_columns = deaths_data.columns[12:] # skipping non-date columns
deaths_data_melted = deaths_data.melt(id_vars=['Province_State'], value_vars=date_columns,
var_name='date', value_name='deaths')
deaths_data_melted['date'] = pd.to_datetime(deaths_data_melted['date'])
# Aggregate deaths by state and date
deaths_by_state_date = deaths_data_melted.groupby(['Province_State', 'date']).sum().reset_index()
# Inspect the processed data for deaths
deaths_by_state_date.head()
C:\Users\varsh\AppData\Local\Temp\ipykernel_3264\834874680.py:8: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. deaths_data_melted['date'] = pd.to_datetime(deaths_data_melted['date'])
| Province_State | date | deaths | |
|---|---|---|---|
| 0 | Alabama | 2020-01-22 | 0 |
| 1 | Alabama | 2020-01-23 | 0 |
| 2 | Alabama | 2020-01-24 | 0 |
| 3 | Alabama | 2020-01-25 | 0 |
| 4 | Alabama | 2020-01-26 | 0 |
The code extracts the minimum and maximum dates from each dataset to determine the range of dates covered. By comparing these date ranges, it allows for assessing whether the datasets cover the same time period or if there are discrepancies. This information is crucial for ensuring data consistency and for conducting further analysis, such as correlating vaccination rates with mortality rates over the same time frame.
# Check the date ranges for both datasets
vaccination_date_range = (vaccinations_data['date'].min(), vaccinations_data['date'].max())
death_date_range = (deaths_by_state_date['date'].min(), deaths_by_state_date['date'].max())
vaccination_date_range, death_date_range
((Timestamp('2020-12-20 00:00:00'), Timestamp('2023-05-10 00:00:00')),
(Timestamp('2020-01-22 00:00:00'), Timestamp('2023-03-09 00:00:00')))
Filtering and merging COVID-19 vaccination and deaths data for overlapping date range
# Filter both datasets to the overlapping date range
start_date = pd.Timestamp('2020-12-20')
end_date = pd.Timestamp('2023-03-09')
filtered_vaccinations = vaccinations_data[(vaccinations_data['date'] >= start_date) & (vaccinations_data['date'] <= end_date)]
filtered_deaths = deaths_by_state_date[(deaths_by_state_date['date'] >= start_date) & (deaths_by_state_date['date'] <= end_date)]
# Merge the datasets on state and date
merged_data = pd.merge(filtered_vaccinations, filtered_deaths, left_on=['location', 'date'], right_on=['Province_State', 'date'], how='inner')
# Renaming some columns for clarity and inspecting the merged dataset
merged_data = merged_data.rename(columns={'location': 'state', 'people_vaccinated': 'total_people_vaccinated'})
merged_data.head()
| date | state | total_vaccinations | total_distributed | total_people_vaccinated | people_fully_vaccinated_per_hundred | total_vaccinations_per_hundred | people_fully_vaccinated | people_vaccinated_per_hundred | distributed_per_hundred | daily_vaccinations_raw | daily_vaccinations | daily_vaccinations_per_million | share_doses_used | total_boosters | total_boosters_per_hundred | Province_State | deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-12 | Alabama | 78134.0 | 377025.0 | 70861.0 | 0.15 | 1.59 | 7270.0 | 1.45 | 7.69 | NaN | NaN | NaN | 0.207 | NaN | NaN | Alabama | 5573 |
| 1 | 2021-01-13 | Alabama | 84040.0 | 378975.0 | 74792.0 | 0.19 | 1.71 | 9245.0 | 1.53 | 7.73 | 5906.0 | 5906.0 | 1205.0 | 0.222 | NaN | NaN | Alabama | 5760 |
| 2 | 2021-01-14 | Alabama | 92300.0 | 435350.0 | 80480.0 | NaN | 1.88 | NaN | 1.64 | 8.88 | 8260.0 | 7083.0 | 1445.0 | 0.212 | NaN | NaN | Alabama | 5945 |
| 3 | 2021-01-15 | Alabama | 100567.0 | 444650.0 | 86956.0 | 0.28 | 2.05 | 13488.0 | 1.77 | 9.07 | 8267.0 | 7478.0 | 1525.0 | 0.226 | NaN | NaN | Alabama | 6030 |
| 4 | 2021-01-16 | Alabama | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7498.0 | 1529.0 | NaN | NaN | NaN | Alabama | 6119 |
import plotly.io as pio
pio.renderers.default = 'notebook'
COVID-19 Vaccinations and Deaths Heatmaps: This code generates heatmaps using Plotly to visualize COVID-19 vaccinations and deaths data aggregated by month and state, providing insights into the distribution of vaccinations and deaths across different states over time.
import pandas as pd
import plotly.graph_objects as go
# Assuming 'merged_data' is your DataFrame and it's already properly loaded and aggregated
# Convert date column to datetime and extract month or appropriate time period
merged_data['date'] = pd.to_datetime(merged_data['date'])
merged_data['month'] = merged_data['date'].dt.strftime('%Y-%m')
# Pivot the data for vaccinations heatmap
vaccinations_pivot = merged_data.pivot_table(values='total_people_vaccinated',
index='state',
columns='month',
aggfunc='max', # Assuming you want the max for the month
fill_value=0)
# Pivot the data for deaths heatmap
deaths_pivot = merged_data.pivot_table(values='deaths',
index='state',
columns='month',
aggfunc='sum', # Assuming you want the sum for the month
fill_value=0)
# Vaccinations Heatmap
fig_vaccinations = go.Figure(data=go.Heatmap(
z=vaccinations_pivot.values,
x=vaccinations_pivot.columns,
y=vaccinations_pivot.index,
colorscale='Viridis'))
fig_vaccinations.update_layout(
title='COVID-19 Vaccinations Heatmap',
xaxis_nticks=36,
yaxis_nticks=36)
# Deaths Heatmap
fig_deaths = go.Figure(data=go.Heatmap(
z=deaths_pivot.values,
x=deaths_pivot.columns,
y=deaths_pivot.index,
colorscale='Viridis'))
fig_deaths.update_layout(
title='COVID-19 Deaths Heatmap',
xaxis_nticks=36,
yaxis_nticks=36)
# Show the figures
fig_vaccinations.show()
fig_deaths.show()
This code creates two Plotly heatmaps to visualize the impact of COVID-19 vaccinations on 'merged_data'. By pivoting the data to separate vaccinations and deaths, it illustrates the total number of people vaccinated and COVID-19-related deaths over time for each state. These visualizations provide insights into the effectiveness of vaccination efforts in reducing fatalities across states.